Search CORE

23 research outputs found

Motif Discovery through Predictive Modeling of Gene Regulation

Author: A. Battle
A.P. Gasch
C.E. Lawrence
E. Segal
E. Segal
E. Wingender
E.M. Conlon
G.Z. Hertz
H.J. Bussemaker
J.D. Hughes
N. Slonim
R.E. Schapire
T. Cover
T.I. Lee
T.L. Bailey
Y. Pilpel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

We present MEDUSA, an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting. At each iteration of the algorithm, MEDUSA builds a motif model whose presence in the promoter region of a gene, coupled with activity of a regulator in an experiment, is predictive of differential expression. In this way, we learn motifs that are functional and predictive of regulatory response rather than motifs that are simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model of the transcriptional control logic that can predict the expression of any gene in the organism, given the sequence of the promoter region of the target gene and the expression state of a set of known or putative transcription factors and signaling molecules. Each motif model is either a

k

-length sequence, a dimer, or a PSSM that is built by agglomerative probabilistic clustering of sequences with similar boosting loss. By applying MEDUSA to a set of environmental stress response expression data in yeast, we learn motifs whose ability to predict differential expression of target genes outperforms motifs from the TRANSFAC dataset and from a previously published candidate set of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed binding sites associated with environmental stress response from the literature.Comment: RECOMB 200

arXiv.org e-Print Archive

CiteSeerX

Crossref

Unsupervised learning of multiple motifs in biopolymers using expectation maximization

Author: A. Bairoch
A.P. Dempster
B. Crombrugghe de
C.B. Harley
C.E. Lawrence
C.E. Lawrence
Charles Elkan
D. Haussler
E.C. Uberbacher
G.D. Stormo
G.D. Stormo
G.D. Stormo
G.Z. Hertz
J.M. Varley
J.R. Quinlan
L. Breiman
L.F. Kolakowski
L.R. Cardon
O.G. Berg
T.L. Bailey
Timothy L. Bailey
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/1995
Field of study

Crossref

Using PhyloCon to Identify Conserved Regulatory Motifs

Author: Hertz G.Z.
Stormo G.D.
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

An Empirical Prior Improves Accuracy for Bayesian Estimation of Transcription Factor Binding Site Frequencies within Gene Promoters

Author: Hertz G.Z.
Heumann J.M.
Ramsey S.A.
Xing E.P.
Publication venue: 'SAGE Publications'
Publication date
Field of study

Crossref

New Bounds for Motif Finding in Strong Instances

Author: A. Panconesi
B. Brejova
C. McDiarmid
G.Z. Hertz
M. Li
W.J. Hoeffding
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Crossref

Discriminative Motifs

Author: Arnone M.I.
Grundy W.N.
Hertz G.Z.
Marsan L.
Pavlidis P.
Sinha S.
Tompa M.
Publication venue: 'Mary Ann Liebert Inc'
Publication date
Field of study

Crossref

PROJECTION Algorithm for Motif Finding on GPUs

Author: C. Chen
C. Lawrence
C. Lawrence
D. Kirk
G.Z. Hertz
K. Shida
S. Park
Y. Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

Monotone Scoring of Patterns with Mismatches

Author: A. Apostolico
A. Apostolico
C.E. Lawrence
G.Z. Hertz
I. Jonassen
I. Jonassen
J. Buhler
T.L. Bailey
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

We study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient methods for computing the probability and expected number of occurrences for substrings of x with (either exactly or up to) k mismatches. Two related schemes are presented. In the first one, an O(nk) time preprocessing of x is developed that supports the following subsequent queries: for any substring w of x arbitrarily specified as input, the probability of occurrence of w in x within (either exactly or up to) k mismatches is reported in O(k(2)) time. In the second scheme, a length or length range is arbitrarily specified, and the above probabilities are computed for all substrings of x having length in that range, in overall O(nk) time. Further, monotonicity conditions are introduced and studied for probabilities and expected occurrences of a substring under unit increases in its length, allowed number of errors, or both. Over intervals of constant frequency count, these monotonicities translate to some of the scores in use, thereby reducing the size of tables at the outset and enhancing the process of discovery. These latter derivations extend to patterns with mismatches an analysis previously devoted to exact patterns

Crossref

Archivio istituzionale della ricerca - Università di Padova

Application of Genetic Algorithms to the Genetic Regulation Problem

Author: A. Gonzalez
D.E. Goldberg
G.D. Stormo
G.Z. Hertz
J. Collado-Vides
M. Hernandez
V. Espinosa
Z. Michalewicz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

Crossref

Interface similarity improves comparison of DNA-binding proteins: the homeobox example

Author: A.G. Murzin
A.V. Morozov
B. Contreras-Moreira
B. Contreras-Moreira
E. Wingender
E.B. Lewis
G.Z. Hertz
H.M. Berman
M.B. Noyes
M.F. Berger
N.M. Luscombe
R.C. Edgar
S. Henikoff
S. Mahony
S.F. Altschul
T.L. Bailey
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

10th Spanish Symposium, JBI 2010, Torremolinos, Spain, October 27-29, 2010. Revised Selected PapersThe recently published 3D-footprint database contains an up-to-date repository of protein-DNA complexes of known structure that belong to different superfamilies and bind to DNA with distinct specificities. This repository can be scanned by means of sequence alignments in order to look for similar DNA-binding proteins, which might in turn recognize similar DNA motifs. Here we take the complete set of Homeobox proteins from Drosophila melanogaster and their preferred DNA motifs, which would fall in the largest 3D-footprint superfamily and were recently characterized by Noyes and collaborators, and annotate their interface residues. We then analyze the observed amino acid substitutions at equivalent interface positions and their effect on recognition. Finally we estimate to what extent interface similarity, computed over the set of residues which mediate DNA recognition, outperforms BLAST expectation values when deciding whether two aligned Homeobox proteins might bind to the same DNA motif.This work was funded by Programa Euroinvestigación 2008 [EUI2008-03612].Peer reviewe

Crossref

Digital.CSIC